abline() for vertical lines, legend() for text labels)curve(), legend())segments(), area under the curve with polygon())polygon(), annotations with arrows())axis())TODOs:
This notebook contains code and figures for statistics and data science education. It stems from a statistics course I gave a the HTW Berlin. All figures are generated with base R plotting functions and no additional packages are required. This notebook may be useful for people in statistics or data science education and for those who want to learn more about (advanced) base R plotting and mathematical annotations in R plots. Each section title below mentions in parentheses new techniques or concepts which are introduced in the section. I chose base R graphics because I find the plots very elegant and especially useful for function plotting.
I start with the binomial distribution and since it is a discrete probability distribution, I use a bar plot via barplot(). I find the formula notation best when using this function. You simply provide a vector of x and y values and a notation like y ~ x will hence plot bars of height \(y_i\) for each corresponding \(x_i\).
I also use mathematical notation in the y-axis label and the plot title. Adding mathematical notation to plots in R is quite a hassle, as R basically requires you to provide the notation in some crude mixture between R and LaTeX notation. The syntax is described on the R documentation page ?plotmath and can be showcased using demo(plotmath) in the console.
Pure mathematical expressions can be plotted via expression(...) as seen below for the y-axis label. To mix text and mathematical notation, use paste() inside expression(). For unknown reasons, paste() will not produce a space between the text and the mathematical expression, i.e. it works like paste0() when used within expression().1 Also for unknown reasons, the \(\sim\)-operator %~% will add a space before, but not after the operator. You will need to use ~ to add a space after the operator.
I don’t find the typesetting as elegant as LaTeX, but it works for most scenarios.
p <- 0.2
x <- 0:10
n <- 10
y <- dbinom(x, n, p)
barplot(y ~ x,
xlab = "x", ylab = expression(P(X == x)),
main = expression(paste("Binomial distribution ", X %~% ~ B(10, 0.2))))
It’s not ideal to hard code concrete values for \(n\) and \(p\) in the plot title. In order to pass R variables into mathematical expressions, you can use substitute(<expr>, <vars>) instead of expression(). The first argument provides the mathematical expression with placeholders like n and p in the code below. The second argument must be a list that maps the placeholders to actual values. Note that if you pass a vector instead of a scalar value using <vars>, only the first value will be taken from the vector.
p <- 0.2
x <- 0:10
n <- 10
y <- dbinom(x, n, p)
barplot(y ~ x,
xlab = "x", ylab = expression(P(X == x)),
main = substitute(paste("Binomial distribution ", X %~% ~ B(n, p)), list(n=n, p=p)))
abline() for vertical lines, legend() for text labels)The following code shows an example of a skewed distribution using the log-normal distribution. To draw a smooth curve, I generate linearly spaced values \(x_i\) using seq() and calculate the respective \(y_i\) value with the dlnorm() function. The function is plotted using a line plot (plot() with parameter type="l").
Annotations for mean and median are added as vertical lines using abline() with v (for vertical) parameter and a helper function that uses legend(). One could also add text annotations to the plot via text(), but only legend() provides a nice box around the text.
You may notice that the font in the plot title is now different than in the above plots (bold below and normal above). This is due to an (unfortunate) default setting in R – mathematical expressions are plotted by default in normal font and titles without mathematical expressions are plotted in bold by default. You can either make plot titles with expressions bold by using bold(<expr>) inside expression() / substitute() or make all plot titles appear in normal font by passing the graphical parameter font.main = 1 to the respective plot function.
x <- seq(0, 3, by = 0.001)
unit <- 100000
m <- 0.6
s <- 0.08
y <- dlnorm(x, log(m))
y <- y/sum(y)
mu <- sum(x * y * unit)
med <- max(x[cumsum(y) <= 0.5]) * unit
addlabel <- function(x, y, label, col, xjust = 0.05) {
legend(x, y, label, text.col = col, box.col = "lightgray", xjust = xjust, x.intersp = 0)
}
plot(x * unit, y, type = "l",
xlab = "Income in €", ylab = "Density",
main = "A fictitious but typical income distribution")
abline(v = c(med, mu), lty = "dashed", col = c("red", "blue"))
addlabel(med, 0.001, paste("Median:", med, "€"), "red")
addlabel(mu, 0.0008, paste("Mean:", round(mu), "€"), "blue")
curve(), legend())The following plots illustrate the transition from binomial to normal distribution when increasing \(n\). I draw graphics for increasing values of \(n\) in a for-loop2 and additionally overlay the final plot with the corresponding normal distribution density function. I’m also using vertical lines via plot() with type="h" here instead of a bar plot, since the latter causes overplotting at high \(n\). For each different \(n\), I’m setting some graphical properties using the xlim and lwd lists. curve() is used for plotting the density function. It’s very similar to a line plot with plot() and type="l", but you directly pass the function that should be plotted and it takes care about generating the right \(x_i\) values for it.
Usually it’s better to use ggplot2 or lattice when generating multiple graphics, i.e. as facets or small multiples. In this case, however, I need these plots as separate figures to be included in presentation slides.
p <- 0.5
xlim <- list(
"10" = c(0, 10),
"100" = c(35, 65),
"1000" = c(450, 550)
)
lwd <- list(
"10" = 5,
"100" = 3,
"1000" = 1
)
lastn <- 0
for (n in c(10, 100, 1000, 1000)) {
x <- 0:n
y <- dbinom(x, n, p)
k <- as.character(n)
plot(x, y, type = "h", lwd = lwd[[k]],
xlab = 'x – Number of "heads"', ylab = expression(P(X == x)),
main = sprintf("Binomial distribution for %d coin flips", n),
xlim = xlim[[k]])
if (lastn == 1000) { # add the curve for the normal distribution
mu <- n*p
sigma <- n*p/sqrt(n)
f <- function(x) { dnorm(x, mean = mu, sd = sigma) }
curve(f, xlim[[k]][1], xlim[[k]][2], add = TRUE, col = "red")
legend(510, 0.025, substitute(paste("Normal distribution ", N(mu, sigma)),
list(mu=mu, sigma=round(sigma, 2))),
lty = "solid", col = "red", cex = 0.75)
}
lastn <- n
}
segments(), area under the curve with polygon())The following plot uses waiting time (waiting for a bus, waiting at the doctor’s, etc.) as an example for a continuous uniform distribution. segments() can be used to draw line segments from \(x_1, y_1\) to \(x_2, y_2\).
a <- 0
b <- 10
y <- 1/(b-a)
plot(c(a, b), c(y, y), xlim = c(a-1, b+1), ylim = c(0, y*1.1), pch = 16,
xlab = "Waiting time in minutes", ylab = "Density",
main = "Density function of the waiting time")
segments(a-1, 0, a, 0)
segments(a, 0, a, y, lty = "dashed")
segments(a, y, b, y)
segments(b, y, b, 0, lty = "dashed")
segments(b, 0, b+1, 0)
To highlight a specific probability I draw shaded areas with polygon(). The coordinates of the vertices are passed as separate \(x\) and \(y\) vectors (a list with \(x\) and \(y\) components or a two-column matrix are also possible). The polygon is automatically closed, i.e. the last point and the first point are joined.
Please also note the usage of ~ in the mathematical expression used in substitute(), which is necessary because of the inequality \(a \le X \le b\). For reasons unknown to me, R will not accept an expression like P(a <= X <= b) in expression() / substitute() and will answer with Error: unexpected '<=' in "expression(P(a <= X <=". The somewhat surprising solution is to add a ~-operator before the X.
a <- 0
b <- 10
y <- 1/(b-a)
x1 <- 7.5
x2 <- 10
plot(c(a, b), c(y, y), xlim = c(a-1, b+1), ylim = c(0, y*1.1), pch = 16,
xlab = "Waiting time in minutes", ylab = "Density",
main = "Density function of the waiting time")
polygon(c(x1, x1, x2, x2), c(0, y, y, 0), col = "lightgray", border = FALSE)
text(x1 + (x2 - x1) / 2, y/2, substitute(P(a <= ~X <= b), list(a=x1, b=x2)), cex = 0.75)
segments(a-1, 0, a, 0)
segments(a, 0, a, y, lty = "dashed")
segments(a, y, b, y)
segments(b, y, b, 0, lty = "dashed")
segments(b, 0, b+1, 0)
The following code again generates figures intended to be used in slides. I step-by-step plot the density function of the normal distribution with different values for \(\mu\) and \(\sigma\) using a nested for-loop. The parameters and different line styles are taken from lists. Each figure is first prepared in the outer for-loop as empty canvas using plot(NULL, ...) which sets limits for the axes and the labels. In the inner for-loop, the density functions are added using curve() with parameter add = TRUE. The legend labels and styles are collected in the inner loop and finally the legend is drawn for each figure via legend().
param <- list(
c(0, 1),
c(0, 0.5),
c(0, 2),
c(2, 1),
c(-1, 0.4)
)
styles <- list(
c("solid", "darkred"),
c("solid", "coral"),
c("solid", "red"),
c("dashed", "darkred"),
c("dotted", "orange")
)
for (maxshow in 1:length(param)) { # outer loop: number of curves to show
# create empty plot
plot(NULL, xlim = c(-5, 5), ylim = c(0, 1.0),
main = "Normal distribution with different parameters",
xlab = "x",
ylab = "f(x)")
legend_labels <- character()
legend_lty <- character()
legend_col <- character()
for (i in 1:maxshow) { # inner loop: draw up to `maxshow` curves
p <- param[[i]]
m <- p[1]
s <- p[2]
k <- sprintf("m%.1f_s%.1f", m, s)
linest <- styles[[i]]
f <- function(x) { dnorm(x, mean = m, sd = s) }
curve(f, lwd = 3, lty = linest[1], col = linest[2], n = 1001, add = TRUE)
legend_labels <- append(legend_labels, sprintf("N(%.1f, %.1f)", m , s))
legend_lty <- append(legend_lty, linest[1])
legend_col <- append(legend_col, linest[2])
}
legend("topright", legend_labels, lty = legend_lty, col = legend_col)
}
polygon(), annotations with arrows())This plot visualizes the 68-95-99.7 rule: within one, two and three standard deviations of the mean lie about 68%, 95% and 99.7% of the values in a normal distribution. I hence plot the areas under the standard normal distribution for different intervals of standard units \([-z, z]\) with \(z \in \{1, 2, 3\}\).
For shading the areas, I again use polygon. This time though, constructing the \(x\) and \(y\) coordinates for the vertices requires a little more effort. In the code below x and y refer to \(x\) and \(f(x)\) for the density function, respectively. I first construct a boolean vector iv that indicates which coordinates from x and y are inside the interval \([-z, z]\). We can refer to these values by x[iv] and y[iv] respectively – these are the parts of the density function between \(-z\) and \(z\) and form the upper edge of the polygon. However, to shade the full area we need to extend the coordinates vector for the bottom points at \((-z, 0)\) and \((z, 0)\) and hence provide c(-z, x[iv], z) for the \(x\) coordinates and c(0, y[iv], 0) for the \(y\) coordinates of the polygon. The order of these values is important, so that the last and first point again can be joined for closing the polygon.
In order to give these areas different colors, I could again use a list of different styles. This time however, I’m simply using a semi-transparent color "#aaaaaa50" where the last color component 50 sets the alpha channel to \(5 \cdot 16/256 \approx 31\%\) transparency. Since areas with smaller \(z\) are covered by areas with higher \(z\), the semi-transparent gray adds up to darker tones for smaller \(z\).
Finally, I’m using abline() again for vertical dashed lines and arrows() to depict the probabilities in each interval.
# plot the standard normal density function
x <- seq(-3.5, 3.5, length = 1001)
y <- dnorm(x)
plot(x, y, ylim = c(0, 0.6), type = "l",
xlab = "z", ylab = "f(z)", main = "Standard normal distribution N(0, 1)")
for (z in 1:3) { # iterate through the standard units z
# add semi-transparent shaded area in [-z, z]
iv <- x >= -z & x <= z
polygon(c(-z, x[iv], z),
c(0, y[iv], 0),
col = "#aaaaaa50", border = FALSE)
# add vertical lines at -z and z
abline(v = c(-z, z), lty = "dashed", col = paste0("#000000", c("FF", "AA", "66")[z]))
# add arrow and text label annotations to depict the probability
y_segm <- 0.38 + 0.055 * z
arrows(-z, y_segm, z, y_segm, code = 3, length = 0.1)
p <- round((1-pnorm(-z) * 2) * 100,
ifelse(z < 3, 0, 1))
text(0, y_segm, paste0(p, "%"), adj = c(0.5, -0.2))
}
axis())This is the same plot as above, but this time it shows a general normal distribution instead of the standard normal. I’m using axis() to provide mathematical expressions in the x-axis labels.
x <- seq(-3.5, 3.5, length = 1001)
y <- dnorm(x)
plot(x, y, ylim = c(0, 0.6), type = "l",
xlab = "x", ylab = "f(x)",
main = expression(paste("Normal distribution ", N(mu, sigma))),
xaxt = "n")
for (z in 1:3) {
iv <- x >= -z & x <= z
polygon(c(x[iv][1], x[iv], max(x[iv])),
c(0, y[iv], 0),
col = "#aaaaaa50", border = FALSE)
abline(v = c(-z, z), lty = "dashed", col = paste0("#000000", c("FF", "AA", "66")[z]))
y_segm <- 0.38 + 0.055 * z
arrows(-z, y_segm, z, y_segm, code = 3, length = 0.1)
text(0, y_segm, paste0(round((1-pnorm(-z) * 2) * 100, 1), "%"), adj = c(0.5, -0.2))
}
xticks <- -3:3
axis(1, at = -3:3, labels = expression(mu - 3 * sigma,
mu - 2 * sigma,
mu - sigma,
mu,
mu + sigma,
mu + 2 * sigma,
mu + 3 * sigma))
Students often confuse the density function \(f(x)\) (aka probability density function – pdf), the distribution function \(F(x)\) (aka cumulative distribution function – cdf), the quantile function \(F^{-1}(x)\) and their respective counterparts in R. I therefore always depict these functions, explain how they related to each other and how they can be used in R.
x <- seq(-4, 4, length = 1000)
y <- dnorm(x)
plot(x, y, type = "l",
main = "Density function f(x) of the standard normal distribution:\ndnorm(x)",
xlab = "x", ylab = "f(x)")
x <- seq(-4, 4, length = 1000)
y <- pnorm(x)
plot(x, y, type = "l",
main = "Distribution function F(x) of the standard normal distribution:\npnorm(x)",
xlab = "x", ylab = "F(x)")
This plot of the distribution function is used to introduce the quantile function, which some students have problems to comprehend.
x <- seq(-4, 4, length = 1000)
y <- pnorm(x)
plot(x, y, type = "l",
main = "Distribution function F(x) of the standard normal distribution:\npnorm(x)",
xlab = "x", ylab = "F(x)")
p <- 0.9
q <- qnorm(0.9)
segments(-4, p, q, p, lty = "dashed", col = "red")
segments(q, p, q, 0, lty = "dashed", col = "red")
text(-3.5, p, paste("F(x) =", p), pos = 3, col = "red")
text(q, 0, paste("For which x?"), pos = 4, col = "red")
This plot of the quantile function picks up the question about solving for \(x\) in \(F(x) = 0.9\) from the previous plot. Note how you have to write \(F^{-1}(x)\) as F^{-1} * (x). The * is required to juxtapose \(F^{-1}\) and \((x)\) without horizontal spacing (using ~ instead would introduce spacing).
x <- seq(0, 1, length = 1000)
y <- qnorm(x)
p <- 0.9
q <- qnorm(0.9)
plot(x, y, type = "l",
main = expression(paste("Quantile function ", F^{-1} * (x),
" of the standard normal distribution: qnorm(x)")),
xlab = "x", ylab = expression(F^{-1} * (x)))
segments(0, q, p, q, lty = "dashed", col = "red")
segments(p, q, p, -3, lty = "dashed", col = "red")
text(p, q + 0.4, substitute(F^{-1} * (p) %~~% q, list(p=p, q=round(q, 2))), col = "red")
After density, distribution and quantile function, the list of functions provided by R for each distribution family can be completed with the random number generation function. As an example, I’m using rnorm() to generate 1000 values x from the standard normal distribution. I’m using density() to approximate the density function of the generated points. Then, I’m constructing max_y as a vector that contains the maximum \(y\) coordinate for each value in x so that it is below the density function curve. The density curve is plotted via plot() and the random points are added via points(). Here I’m placing each point in x on a random \(y\) coordinate between 0 and the respective value in max_y so that it will always be drawn below the density curve. This works since we can pass a vector like max_y to the max parameter (upper limit of the uniform distribution) of runif() so that for each coordinate in x we have the correct upper limit for the random \(y\) coordinate in order to be placed underneath the curve. Again, I’m using semi-transparent color – this time for the points by setting the fourth parameter in rgb() to 0.5.
set.seed(20231206)
x <- rnorm(1000)
n <- length(x)
d <- density(x)
# for each generated x_i, find the corresponding value of the density curve; this allows to generate a random
# y-coordinate that fits under the density curve
max_y <- sapply(x, function(x_i) {
d$y[min(which(x_i < d$x))]
})
plot(d, main = paste(n, "randomly sampled values from the standard normal distribution"),
xlab = "x", ylab = "Density")
points(x, y = runif(n, 0, max_y), col = rgb(0, 0, 0, 0.5))
The following figures depict sampling distributions and show practical implications of the central limit theorem (CLT).
dotchart(), mathematical expressions, legends)For the following plots I will draw random samples from data about rents in a Germany city. The rent data is left-skewed as we can see in the following plot. For educational reasons, the data is considered to be the “population” from which the samples are drawn. The mean rent from the data is considered the “population mean” \(\mu\).
set.seed(20231024)
d <- read.table('data/miete03.asc', header = TRUE)
rent <- d$nm
mu <- mean(rent)
n <- length(rent)
hist(rent, main = "Histogram of the rent", freq = FALSE,
sub = paste("n =", n), xlab = "Rent in €", ylab = "Probability")
abline(v = mu, lty = 'dashed', col = 'red')
text(mu, 0.0017, expression(mu), adj = c(-0.5, 0), col = 'red')
I now draw \(m=10\) random samples from the population, each consisting of \(n=30\) data points. For each of the \(m\) samples, the mean \(\hat \mu_i\) is calculated and depicted in a dot plot via dotchart() together with the mean \(\overline{\hat \mu}\) of all samples.
n <- 30
m <- 10
hat_mu_i <- sapply(1:m, function(i) {
mean(sample(rent, n))
})
dotchart(hat_mu_i, labels = as.character(1:m), xlab = "Rent in €", ylab = "Sample",
main = substitute(
paste("Sample means ", hat(mu)[i], " for ", m, " samples of size n=", n),
list(m=m, n=n)
))
abline(v = mu, lty = 'dashed', col = 'red')
text(mu, 0.5, expression(mu), adj = c(-0.5, 0), col = 'red')
mean_hat_mu <- mean(hat_mu_i)
abline(v = mean_hat_mu, lty = 'dashed', col = 'blue')
text(mean_hat_mu, 0.5, expression(bar(hat(mu))), adj = c(-0.5, 0), col = 'blue')
The following plot then draws much more random samples (\(m=10,000\)) and shows the resulting sample means as histogram. The plot is annotated with the (tiny) absolute error \(|\overline{\hat \mu} - \mu|\) showing that the sample means are distributed around the population mean \(\mu\).
n <- 30
m <- 10000
hat_mu_i <- sapply(1:m, function(i) {
mean(sample(rent, n))
})
hist(hat_mu_i, freq = FALSE, xlab = "Rent in €", ylab = "Probability",
main = substitute(
paste("Histogram of sample means for ", m, " samples of size n=", n),
list(m=m, n=n)
)
)
abline(v = mu, lty = "solid", col = 'red')
text(mu, 0, expression(mu), adj = c(-0.5, 0), col = 'red')
mean_hat_mu <- mean(hat_mu_i)
abline(v = mean_hat_mu, lty = 'dashed', col = 'blue')
text(mean_hat_mu, 0.0005, expression(bar(hat(mu))), adj = c(-0.5, 0), col = 'blue')
e <- round(abs(mean_hat_mu - mu), 2)
text(mean_hat_mu, 0.0085, expression(abs(paste(" ", bar(hat(mu)) - mu, " "))), adj = c(-1, 0))
text(mean_hat_mu, 0.0085, paste0("= ", e, "€"), adj = c(-2.2, -0.2))
The next plot shows how random samples of different sizes \(n\) are distributed. One can see that for low \(n\), the left-skew of the population is also reflected in the sampling distribution, but as \(n\) is increased, this skew disappears. This shows the importance of using a sufficient sample size when suspecting a skewed population distribution.
Each plot is also drawn on the same x-axis scale, so that students see how the standard deviation of the sampling distribution (i.e. the standard error) reduces with increasing sample size \(n\). I also collect the empirical standard errors calculated from the samples in the vector se_per_samplesize to plot them in the next figure.
m <- 10000
se_per_samplesize <- numeric()
samplesizes <- c(15, 30, 100, 1000)
for (n in samplesizes) {
hat_mu_i <- sapply(1:m, function(i) {
mean(sample(rent, n))
})
hist(hat_mu_i, freq = FALSE, xlab = "Rent in €", ylab = "Probability",
main = substitute(
paste("Histogram of sample means for ", m, " samples of size n=", n),
list(m=m, n=n)
),
breaks = seq(200, 940, by = 5)
)
se_per_samplesize <- c(se_per_samplesize, sd(hat_mu_i))
}
The notion of decreasing the standard error and hence increasing the precision of an estimate by increasing the sample size can then be formalized for the mean \(\overline X\) for iid. samples as \(\sigma_{\overline X} = \frac \sigma {\sqrt n}\). The points in the plot reflect the collected standard errors from the samples se_per_samplesize for each \(n\) in samplesizes. The red dashed line is calculated from the formula for the standard error of sample mean.
plot(samplesizes, se_per_samplesize, type = "p",
main = "Estimation error of the rent by sample size",
xlab = "Sample size",
ylab = "Standard error in €")
nrange <- min(samplesizes):max(samplesizes)
se_theoretic <- sd(rent) / sqrt(nrange)
lines(nrange, se_theoretic, lty = 'dashed', col = 'red')
legend("topright", c("Empirical standard error", "Theoretic standard error"),
pch = c(1, NA_integer_), lty = c(NA_integer_, 2), col = c("black", "red"))
According to the CLT, the mean \(\overline X\) of an iid. sample is normally distributed with expected value \(\mu\) and standard deviation \(\sigma_{\overline X}\) (standard error). We can reuse the plot of the normal distribution with 68-95-99.7-rule and apply it for this sampling distribution.
x <- seq(-3.5, 3.5, length = 1001)
y <- dnorm(x)
plot(x, y, ylim = c(0, 0.6), type = "l",
xlab = "x", ylab = "f(x)",
main = expression(paste("Distribution of the sample mean ", bar(X) %~% ~ N(mu, sigma[bar(X)]))),
xaxt = "n")
for (z in 1:3) {
iv <- x >= -z & x <= z
polygon(c(x[iv][1], x[iv], max(x[iv])),
c(0, y[iv], 0),
col = "#aaaaaa50", border = FALSE)
abline(v = c(-z, z), lty = "dashed", col = paste0("#000000", c("FF", "AA", "66")[z]))
y_segm <- 0.38 + 0.055 * z
arrows(-z, y_segm, z, y_segm, code = 3, length = 0.1)
text(0, y_segm, paste0(round((1-pnorm(-z) * 2) * 100, 1), "%"), adj = c(0.5, -0.2))
}
xticks <- -3:3
axis(1, at = -3:3, labels = expression(mu - 3 * sigma[bar(X)],
mu - 2 * sigma[bar(X)],
mu - sigma[bar(X)],
mu,
mu + sigma[bar(X)],
mu + 2 * sigma[bar(X)],
mu + 3 * sigma[bar(X)]))
To stress again the importance of a sufficient sample size, the following plots sample from a highly left-skewed distribution and depict how the sampling distribution picks up this asymmetry so that it is even visible for a sample size of \(n=100\).
unit <- 100000
mean <- 0.6
x <- seq(0, 3, by = 0.01)
d <- dlnorm(x, log(mean)) * unit
d <- d[d < 100000]
hist(d, breaks = seq(0, 1.2, by = 0.1) * unit, probability = TRUE,
main = "Fictitious income distribution",
xlab = "Income in €",
ylab = "Density")
m <- 10000
for (n in c(10, 100, 1000)) {
means <- sapply(1:m, function(i) {
X <- rlnorm(n, log(mean)) * unit
mean(X)
})
means <- means[means < 250000]
hist(means, probability = TRUE, breaks = seq(0, 2.5, by = 0.02) * unit,
main = paste("Distribution of the sample mean of incomes with n =", n),
xlab = "Sample mean of incomes",
ylab = "Density")
}
set.seed(20231111)
d <- read.table('data/miete03.asc', header = TRUE)
rent <- d$nm
mu <- mean(rent)
sigma <- sqrt(mean((mu-rent)^2))
m <- 100 # number of draws
n <- 30 # sample size
sample_conf_interval <- function(i, data, n, sigma, z) {
X <- sample(data, size = n)
Xbar <- mean(X)
E <- z * sigma/sqrt(n)
c(Xbar - E, Xbar + E)
}
for (gamma in c(0.8, 0.95, 0.99)) { # confidence levels
alpha <- 1 - gamma # significance level
z <- qnorm(1-alpha/2)
conf_ints <- t(sapply(1:m, sample_conf_interval, rent, n, sigma, z))
plot(NULL, xlim = c(300, 850), ylim = c(1, m+1),
main = sprintf("%d%%-confidence intervals for %d random samples (n=%d)\nof the rent dataset",
round(gamma*100), m, n),
xlab = "Rent in €",
ylab = "Sample draw")
abline(v = mu, col = "red", lty = "dashed")
text(mu, 100, expression(mu), col = "red", adj = c(-0.5, -0.5))
mu_covered <- conf_ints[,1] < mu & mu < conf_ints[,2]
conf_int_col <- ifelse(mu_covered, "black", "red")
segments(conf_ints[,1], 1:m, conf_ints[,2], 1:m, col = conf_int_col)
legend("topright", c(expression(paste("Confidence interval covers ", mu)),
expression(paste("Confidence interval doesn't cover ", mu))),
lty = "solid",
col = c("black", "red"))
}
curve(dnorm, lwd = 2, n = 1001, from = -5, to = 5,
lty = "dashed",
main = expression(paste(italic(t), " distributions and the standard normal distribution")),
xlab = "x",
ylab = "f(x)")
nvals <- c(2, 5, 15)
legend_labels <- c("N(0, 1)")
legend_col <- c(gray(0))
legend_lty <- c("dashed")
for (i in 1:length(nvals)) {
n <- nvals[i]
col <- gray(1-0.2*i)
f <- function(x) { dt(x, n) }
curve(f, lwd = 2, n = 1001, from = -5, to = 5, add = TRUE, col = col)
legend_labels <- c(legend_labels, sprintf("t(%d)", n))
legend_col <- c(legend_col, col)
legend_lty <- c(legend_lty, "solid")
}
legend("topright", legend_labels, col = legend_col, lty = legend_lty)
mu <- 17
X <- c(16.4, 19.4, 17.8, 16.4, 18.5, 19.0, 19.5, 17.1, 16.1)
Xbar <- mean(X)
n <- length(X)
sigma <- 1.5 / sqrt(n)
z <- qnorm(c(0.025, 0.975), mean = Xbar, sd = sigma)
confint <- paste0(round(z, 2), "cm")
plot(X, runif(n), ylim = c(-0.5, 1.5), yaxt = "n",
ylab = "", xlab = "Pencil length in cm",
main = "Random sample of pencil lengths and 95% confidence interval")
abline(h = 0.5)
abline(v = mu, lty = "dashed")
text(mu, -0.45, expression(mu), adj = -0.5)
segments(z[1], -0.1, z[2], -0.1, col = "red")
segments(z[1], -0.05, z[1], -0.15, col = "red")
segments(z[2], -0.05, z[2], -0.15, col = "red")
segments(Xbar, -0.05, Xbar, -0.15, col = "red")
text(c(z, Xbar), 0.075, c(confint, paste0(Xbar, "cm")), col = "red")
text(Xbar, -0.45, expression(bar(X)), col = "red")
X <- c(16.4, 19.4, 17.8, 16.4, 18.5, 19.0, 19.5, 17.1, 16.1)
Xbar <- mean(X)
mu <- 17
n <- length(X)
sigma <- 1.5 / sqrt(n)
z <- qnorm(c(0.025, 0.975), mean = mu, sd = sigma)
p <- 1-pnorm(Xbar, mu, sigma)
x <- seq(15, 19, length = 1001)
y <- dnorm(x, mu, sigma)
plot(x, y, type = "l", xlab = "Pencil length in cm", ylab = "f(x)",
main = expression(paste("Distribution of ", bar(X), " around ", mu, "=17cm under ", H[0],
" (", sigma, "=1.5cm and n=9)")))
iv <- x < z[1]
polygon(c(x[iv][1], x[iv], max(x[iv])),
c(0, y[iv], 0),
col = "#aaaaaa50", border = FALSE)
iv <- x > z[2]
polygon(c(x[iv][1], x[iv], max(x[iv])),
c(0, y[iv], 0),
col = "#aaaaaa50", border = FALSE)
abline(v = z, lty = "dashed", col = "#00000066")
abline(v = mu, lty = "dashed", col = "#000000DD")
text(mu, 0, expression(mu), adj = -0.25)
iv <- x >= Xbar
polygon(c(x[iv][1], x[iv], max(x[iv])),
c(0, y[iv], 0),
col = "#ff000050", border = FALSE)
abline(v = Xbar, lty = "dashed", col = "red")
text(Xbar, 0.15, substitute(P(bar(X) >= Xbar) == p,
list(Xbar=Xbar, p=sprintf("%.1f%%", p*100))),
adj = -0.02, col = "red")
text(c(mu, z[2]), 0.035, c("95%", "2.5%"), adj = -0.25)
text(z[1], 0.035, "2.5%", adj = 1.25)
X <- c(16.4, 19.4, 17.8, 16.4, 18.5, 19.0, 19.5, 17.1, 16.1)
Xbar <- mean(X)
mu <- 17
n <- length(X)
sigma <- 1.5 / sqrt(n)
x <- seq(15, 19, length = 1001)
y <- dnorm(x, mu, sigma)
z <- qnorm(c(0.025, 0.975), mean = Xbar, sd = sigma)
konfint <- paste0(round(z, 2), "cm")
plot(x, y, type = "l", xlab = "Pencil length in cm", ylab = "f(x)",
main = expression(paste("Distribution of ", bar(X), " around ", mu, "=17cm under ", H[0],
" (", sigma, "=1.5cm and n=9)")))
abline(v = mu, lty = "dashed")
text(mu, 0.1, expression(mu), adj = -0.5)
segments(z[1], -0.015, z[2], -0.015, col = "red")
segments(z[1], -0.01, z[1], -0.02, col = "red")
segments(z[2], -0.01, z[2], -0.02, col = "red")
segments(Xbar, -0.01, Xbar, -0.02, col = "red")
text(c(z, Xbar), 0.025, c(konfint, paste0(Xbar, "cm")), col = "red")
text(Xbar, 0.07, expression(bar(X)), col = "red")
x <- seq(-4, 4, length = 1001)
y <- dnorm(x)
plot(x, y, type = "l", xlab = "", ylab = "Density",
main = expression(paste("One-sided test: Alternative hypothesis is ", mu < mu[0])),
xaxt = "n")
p <- 0.3
z <- qnorm(p)
abline(v = 0, lty = "dashed")
segments(z, 0, z, dnorm(z), lty = "dashed")
axis(1, c(0, z), labels = c(expression(mu[0]), expression(bar(x))))
iv <- x < z
polygon(c(x[iv][1], x[iv], max(x[iv])),
c(0, y[iv], 0),
col = "#aaaaaa50", border = FALSE)
text(-1.5, 0.01, expression(P(bar(X) <= bar(x))))
plot(x, y, type = "l", xlab = "", ylab = "Density",
main = expression(paste("One-sided test: Alternative hypothesis is ", mu > mu[0])),
xaxt = "n")
abline(v = 0, lty = "dashed")
segments(z, 0, z, dnorm(z), lty = "dashed")
axis(1, c(0, z), labels = c(expression(mu[0]), expression(bar(x))))
iv <- x > z
polygon(c(x[iv][1], x[iv], max(x[iv])),
c(0, y[iv], 0),
col = "#aaaaaa50", border = FALSE)
text(1.3, 0.01, expression(P(bar(X) >= bar(x))))
x <- seq(-4, 4, length = 1001)
y <- dnorm(x)
plot(x, y, type = "l", xlab = "", ylab = "Density",
main = expression(paste("Two-sided test: Sample mean is ", bar(x) < mu[0])),
xaxt = "n")
p <- 0.3
z <- qnorm(p)
abline(v = 0, lty = "dashed")
segments(z, 0, z, dnorm(z), lty = "dashed")
axis(1, c(0, z), labels = c(expression(mu[0]), expression(bar(x))))
iv <- x < z
polygon(c(x[iv][1], x[iv], max(x[iv])),
c(0, y[iv], 0),
col = "#aaaaaa50", border = FALSE)
text(-1.5, 0.01, expression(P(bar(X) <= bar(x))))
x <- seq(-4, 4, length = 1001)
y <- dnorm(x)
plot(x, y, type = "l", xlab = "", ylab = "Density",
main = expression(paste("Two-sided test: Sample mean is ", bar(x) > mu[0])),
xaxt = "n")
abline(v = 0, lty = "dashed")
z <- qnorm(1-p)
segments(z, 0, z, dnorm(z), lty = "dashed")
axis(1, c(0, z), labels = c(expression(mu[0]), expression(bar(x))))
iv <- x > z
polygon(c(x[iv][1], x[iv], max(x[iv])),
c(0, y[iv], 0),
col = "#aaaaaa50", border = FALSE)
text(1.3, 0.01, expression(P(bar(X) >= bar(x))))
n <- 100
p <- 0.1
k <- 12
p_less_k <- round(pbinom(11, 100, 0.1), 3)
p_ge_k <- 1 - p_less_k
subst_params <- list(p_less_k=p_less_k, p_ge_k=p_ge_k)
x <- 0:100
y <- dbinom(x, n, p)
barcols <- ifelse(x >= k, "orange", "gray")
barplot(y ~ x, axis.lty = 1, col = barcols,
xlab = "Number x of waste items",
ylab = "Probability P(X=x)",
main = "Probability of number of waste items under the null hypothesis")
legend("topright",
c(expression(paste("Number of waste items " < 12)),
expression(paste("Number of waste items " >= 12))),
fill = c("gray", "orange"))
text(34, 0.04, substitute(P(X >= 12) == p, list(p = p_ge_k)), adj = 0)
arrows(33, 0.039, 20, 0.03, length = 0.1)
x <- 0:100
y <- pbinom(x, n, p)
barcols <- ifelse(x == k - 1, "red", "gray")
barplot(y, names.arg = as.character(x), axis.lty = 1,
col = barcols,
xlab = "Number x of waste items",
ylab = expression(paste("Probability ", P(X <= x))),
main = "Probability of number of waste items under the null hypothesis")
legend("topright",
expression(paste("Probability ", P(X <= 11))),
fill = "red")
pgreduced <- PlantGrowth[PlantGrowth$group != "trt1", ]
pgreduced$group <- factor(pgreduced$group)
boxplot(weight ~ group, data = pgreduced,
xlab = "Group", ylab = "Dried weight of plants in kg",
main = "Plant growth experiment")
ctrl <- pgreduced$group == "ctrl"
w_ctrl <- pgreduced$weight[ctrl]
w_trt2 <- pgreduced$weight[!ctrl]
xbar <- mean(w_trt2)
ybar <- mean(w_ctrl)
nx <- length(w_trt2)
ny <- length(w_ctrl)
se_x <- sd(w_trt2) / sqrt(nx)
se_y <- sd(w_ctrl) / sqrt(ny)
x <- seq(4, 6.5, length = 1001)
y_trt2 <- dnorm(x, mean = xbar, sd = se_x)
plot(w_trt2, runif(nx, 0, 0.1), col = "darkred", xlim = c(min(x), max(x)), ylim = c(0, 3),
xlab = "Dried weight of plants in kg", ylab = "Density",
main = "Plant growth experiment\nDistribution of sample means")
points(w_ctrl, runif(ny, 0, 0.1), col = "blue")
lines(x, y_trt2, col = "darkred")
abline(v = xbar, lty = "dashed", col = "darkred")
text(xbar, 0.15, substitute(bar(X) == xbar, list(xbar = paste0(xbar, "kg"))), pos = 4, col = "darkred")
y_ctrl <- dnorm(x, mean = ybar, sd = se_y)
lines(x, y_ctrl, col = "blue")
abline(v = ybar, lty = "dashed", col = "blue")
text(ybar, 0.15, substitute(bar(Y) == ybar, list(ybar = paste0(ybar, "kg"))), pos = 2, col = "blue")
legend("topright", c("X: treatment", "Y: control"),
lty = "solid", col = c("darkred", "blue"))
w_ttest_res <- t.test(w_trt2, w_ctrl, alternative = "greater")
se <- w_ttest_res$stderr
d <- w_trt2 - w_ctrl
d_tscale <- d / se
dbar <- mean(d)
dbar_tscale <- dbar / se
xmin <- floor(min(d_tscale)) - 1
xmax <- ceiling(max(d_tscale))
x <- seq(xmin, xmax, length.out = 1001)
y <- dt(x, w_ttest_res$parameter)
plot(x, y, type = "l", xaxt = "n",
xlab = expression(paste("Difference ", bar(X) - bar(Y), " in kg")),
ylab = "Density",
main = expression(paste("Distribution of the difference ", bar(X) - bar(Y),
" under ", H[0], ": ", mu[X] - mu[Y] <= 0)))
xaxt_labels <- seq(floor(xmin*se), ceiling(xmax*se), by = 0.5)
xaxt_labels_pos <- xaxt_labels/se
axis(1, at = xaxt_labels_pos, labels = xaxt_labels)
points(d_tscale, runif(length(d_tscale), 0, 0.01), col = "red")
abline(v = 0, lty = "dashed")
text(0, 0.025, expression(mu[X] - mu[Y] == 0), pos = 4)
abline(v = dbar_tscale, lty = "dashed", col = "red")
iv <- x >= dbar_tscale
polygon(c(x[iv][1], x[iv], max(x[iv])),
c(0, y[iv], 0),
col = "#ff000080", border = FALSE)
text(dbar_tscale + 0.25, 0.025, substitute(P(bar(X) - bar(Y) > dbar) == p,
list(dbar = paste0(dbar, "kg"),
p = round(w_ttest_res$p.value, 4))),
pos = 4, col = "red")
x <- seq(xmin, xmax, length.out = 1001)
y <- dt(x, w_ttest_res$parameter)
plot(x, y, type = "l",
xlab = "t",
ylab = "Density",
main = expression(paste(italic(t), " distribution under ", H[0], ": ", mu[X] - mu[Y] <= 0)))
points(d_tscale, runif(length(d), 0, 0.01), col = "red")
abline(v = 0, lty = "dashed")
text(0, 0.025, expression(mu[X] - mu[Y] == 0), pos = 4)
abline(v = dbar_tscale, lty = "dashed", col = "red")
iv <- x >= dbar_tscale
polygon(c(x[iv][1], x[iv], max(x[iv])),
c(0, y[iv], 0),
col = "#ff000080", border = FALSE)
text(dbar_tscale + 0.25, 0.025, substitute(1-P(t < tval) == pval,
list(dbar = paste0(dbar, "kg"),
tval = round(w_ttest_res$statistic, 4),
pval = round(w_ttest_res$p.value, 4))),
pos = 4, col = "red")
alpha <- 0.05
m <- 1:25
y <- 1-(1-alpha)^m
plot(m, y,
main = substitute(paste("Error rate for multiple tests with ", alpha==a), list(a=alpha)),
xlab = "Number of tests", ylab = "Probability of type I error")
lines(m, y)
Interestingly paste0() doesn’t work at all within expression().↩︎
Yes, there are situations where using a for-loop instead of vectorized operations, lapply(), etc. is fine and this is one of these situations since this loop doesn’t generate or update values, but only causes side effects (drawing plots).↩︎